Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks
نویسندگان
چکیده
We prove linear convergence of gradient descent to a global minimum for the training deep residual networks with constant layer width and smooth activation function. further show that trained weights, as function index, admits scaling limit which is H\"older continuous depth network tends infinity. The proofs are based on non-asymptotic estimates loss norms weights along path. illustrate relevance our theoretical results practical settings using detailed numerical experiments supervised learning problems.
منابع مشابه
Convergence properties of gradient descent noise reduction
Gradient descent noise reduction is a technique that attempts to recover the true signal, or trajectory, from noisy observations of a non-linear dynamical system for which the dynamics are known. This paper provides the first rigorous proof that the algorithm will recover the original trajectory for a broad class of dynamical systems under certain conditions. The proof is obtained using ideas f...
متن کاملHandwritten Character Recognition using Modified Gradient Descent Technique of Neural Networks and Representation of Conjugate Descent for Training Patterns
The purpose of this study is to analyze the performance of Back propagation algorithm with changing training patterns and the second momentum term in feed forward neural networks. This analysis is conducted on 250 different words of three small letters from the English alphabet. These words are presented to two vertical segmentation programs which are designed in MATLAB and based on portions (1...
متن کاملGradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks
We analyze algorithms for approximating a function $f(x) = \Phi x$ mapping $\Re^d$ to $\Re^d$ using deep linear neural networks, i.e. that learn a function $h$ parameterized by matrices $\Theta_1,...,\Theta_L$ and defined by $h(x) = \Theta_L \Theta_{L-1} ... \Theta_1 x$. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution...
متن کاملConvergence of Stochastic Gradient Descent for PCA
We consider the problem of principal component analysis (PCA) in a streaming stochastic setting, where our goal is to find a direction of approximate maximal variance, based on a stream of i.i.d. data points in R. A simple and computationally cheap algorithm for this is stochastic gradient descent (SGD), which incrementally updates its estimate based on each new data point. However, due to the ...
متن کاملImplicit Regularization in Deep Learning
In an attempt to better understand generalization in deep learning, we study several possible explanations. We show that implicit regularization induced by the optimization method is playing a key role in generalization and success of deep learning models. Motivated by this view, we study how different complexity measures can ensure generalization and explain how optimization algorithms can imp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Social Science Research Network
سال: 2022
ISSN: ['1556-5068']
DOI: https://doi.org/10.2139/ssrn.4084172